## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
There are total 1599 rows and 13 columns are there.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our target variable is Quality, which has min 3.000, max 8.000 and Mean of 5.636.
Now we will explore each feature independently.
Quality is our target variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
quality has min 3.00, max 8.00 and Mean 5.636.
Let’s draw a histogram of quality
Most of the wines have quality either 5 or 6. Very few wines have quality 3 or 8. How much percent of wines have quality 5 or 6?
## [1] 0.8248906
82.5% total wines have quality as 5 or 6.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
We can’t clearly see the pattern here. Let’s put binwidth to see the distribution clearly.
There is a big spike when citric.acid is 0.00. Total there are there major peaks at 0.00, 0.25 and 0.50. After 0.50 the count starts reducing. How many wines have citric acid = 0?
## [1] 132
There are 132 wines that have 0 citric acid.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
residual.sugar has extreme outliers. Let’s use boxplot to explore more about this feature.
mean of residual.sugar is 2.539. But there are some values that have residual.sugar 15.500. Most of the values are below 4.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Mean is 46.47 and Max is 289.00.
This distribution is positively skewed. Most of the values are less than 160. Let’s log transform this.
Log transform appears to be normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The histogram plot shows pH is normally distributed at Mean 3.311 and most values are in between 3.0 and 3.6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
There are some high values. Let’s trim the values.
Now the histogram looks normally distributed. Most of the values are in between 0.35 and 0.95
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Let’s change binwidth to get more clarity
This is positively skewed. Let’s log transform this.
The plot shows the peak is at 9.75.
A: There are total 1599 observations with 12 features, including “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”. quality : 0 - 10 (worst —> best). Most of the values are in between 5-7
A: The quality of the wine is based on smell, taste and color of the wine. So we can roughly say that citric.acid, sulphates and alcohol are influencing the quality of wine.
A: I think acidity(fixed/volatile), chlorides may contribute to the quality of wine.
A: No. I didn’t create any new variables in the dataset.
A: I used coord_cartesian to limit x axis for sulphates because there are some extreme outliers. The positively skewed histograms of total.sulfur.dioxide and alcohol are log transformed.
red_wine$quality.factor <- as.factor(red_wine$quality)
I am gonna plot correlation plots of all features using ggpairs. For convience I want to do this in two steps.
Now we are going to explore relationship between quality and other features closely. In these plots, I am also plotting mean values inside the boxplots. Now We will have an additional parameter to explore our plots.
The figure shows a complex relationship between Quality and Fixed.acidity. There doesn’t seem to be significant pattern between the two.
It seems that high quality wines have lower volatile.acidity levels
The plot shows high quality wines tends to have high citric acid. But there are some outliers in quality 7, which have approximately 0.00 citric.acid and one value with 1.00 have quality 4.
There doesn’t seem to be a relationship between quality and residual.sugar
Overall not that much difference. But high quality wine contains less chlorides and low quality wine contains high chlorides
We can’t infer anything about relationship between these two.
Same as above, we can’t infer anything from figure.
the plot shows high quality wine tends to have low density
There doesn’t seem to be any significant difference in mean pH values of each quality. So this is not an important factor for quality.
High quality wines tends to have more sulphates
Hig quality wines have higher alcohol levels.
We can observe in density plots that wine of medium quality 5 more often fall into the range of low citric.acid, low sulphates and low alcohol than wine of quality 4 or 3. red wine quality 5 occur quite often in high volatile acidity
A: Plots of quality against different features of red wine shows that volatile.acidity, citric.acid, sulphates and alcohol are strongly related to quality. lower volatile.acidity, higher sulphates and higher alcohol contribute to higher quality. The plots are also showing that fixed.acidity, residual.sugar and pH are not important factors for quality wine.
A: According to scatter plot of all features, fixed.acidity and citric.acid has high positive correlation, whereas volatile.acidity has high negative correlation. Even though both free/total sulfur dioxide and sulphates both have sulfur in common, there is no correlation between them and effect on red wine quality is different.
A: The correlation coefficients between quality and volatile.acidity, citric.acid, sulphates, alcohol are -0.3906, 0.2264, 0.2514, 0.4762 respectively. The strongest relationship is between alcohol and quality.
In Bivariate plotting section, we found that the four important features that influence quality are volatile.acidity, citric.acid, sulphates and alcohol. We will the combined effect of all these features on quality in this section.
Above plots show that high alcohol and high sulphates both immensely contribute to high quality of wine.
Above plots show that high volatile tends to have low quality wine, but on wines which have citric acid 0.75 to 1.0, the affect is insignificant.
Above plots show that high sulphates contribute to high quality in all different features. When citric acid is in the interval of (0.75,1), the increase of sulphates would cause the quality to drop. Increasing the sulphates would cause the increase in quality best when alcohol is in range (11.4, 12.9) compared to other ranges of alcohol.
We can build a linear model to predict the quality of wine using above features.
##
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity +
## citric.acid, data = red_wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71408 -0.38590 -0.06402 0.46657 2.20393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.64592 0.20106 13.160 < 2e-16 ***
## alcohol 0.30908 0.01581 19.553 < 2e-16 ***
## sulphates 0.69552 0.10311 6.746 2.12e-11 ***
## volatile.acidity -1.26506 0.11266 -11.229 < 2e-16 ***
## citric.acid -0.07913 0.10381 -0.762 0.446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6588 on 1594 degrees of freedom
## Multiple R-squared: 0.3361, Adjusted R-squared: 0.3345
## F-statistic: 201.8 on 4 and 1594 DF, p-value: < 2.2e-16
Linear Regression is used to set up the relationship between alcohol and other features. The intercept is 2.645 and coefficients for alcohol, sulphates, volatile.acidity and citric.acid are 0.30908, 0.69552, -1.26506 and -0.07913 respectively.
A: High alcohol contribute to high quality wine, adding sulphates will increase the quality more. low volatile acidity contribute to high quality wine. The other features will only affect quality when volatile acidity is low. sulphates contribute to quality positively, but when combine with other there are some outliers like when alcohol is between 11.4 and 12.9 and citric acid is between 0.75 and 1.
A: Individually alcohol, volatile acidity, citric acid and sulphates contribute to quality but when combine with each other they are not working as expected.
A: Yes, I created a linear model to predict quality based on alcohol, sulphates, volatile.acidity and citric.acid. This model uses all the features that influence the quality. However this is pretty basic model. We need to revise the model to have good accuracy of quality prediction.
A: Quality is our target feature. This histogram shows majority of wines have quality either 5 or 6. To say precisely 82% of total wines are either 5 or 6.
A: The plot shows boxplots of quality and other four important features that influence quality such as volatile.acidity, citric.acid, sulphates and alcohol. In these volatile.acidity negatively influencing quality whereas the other three are positively influencing quality. Of all these alcohol has strong correlation with quality. Higher quality wine tends to have higher alcohol.
A: There is a strong relationship between quality and alcohol. Adding of sulphates positively influencing the quality of wine. When sulphates are in (0.73, 2), the quality is high even though alcohol is between 11 and 13. Most of the wines that have quality below 5 have sulphates either (0.33, 0.55) and (0.55, 0.62). Even though some wines have same alcohol percentage, adding of sulphates significantly increases the quality.
The red wine dataset contains 1599 observations. Each observation has 12 features. Our target variable is quality which has mean of 5.636. First I plotted histograms of all features. In these I observed that most of the wines have quality either 5 or 6. After that I plotted scatter plots of all variables. quality has strong positive correlation with alcohol, sulphates, citric.acid and strong negative correlation with volatile.acidity. I plotted boxplots of quality versus all other features to explore the relationship between quality and other variables. To know how these important features interact with each other I drew multivariate plots by dividing these features as different slots. alcohol and sulphates both are influencing quality of wine. But all other features are showing mixed results when combine with each other. I created a linear model to predict the quality of wine using four important features such as alcohol, sulphates, citric.acid and volatile.acidity. The model can be revised. We have very limited dataset. Having more data significantly increases the accuracy of our model. But there are so many other models that could be best fit our data like decision trees, random forests and other boosting models.